Skip to content

Fast-start local emulator via RAM snapshot + live secret rotation#1340

Merged
BilalG1 merged 27 commits intodevfrom
local-emulator-qol-fixes
Apr 20, 2026
Merged

Fast-start local emulator via RAM snapshot + live secret rotation#1340
BilalG1 merged 27 commits intodevfrom
local-emulator-qol-fixes

Conversation

@BilalG1
Copy link
Copy Markdown
Collaborator

@BilalG1 BilalG1 commented Apr 15, 2026

Summary

stack emulator start now resumes a fully-warm VM snapshot instead of cold-booting, bringing startup from 30–120s down to ~5–8s with per-install secret rotation, or ~2.5s with rotation opt-out. The snapshot is captured locally on first stack emulator pull, not shipped from CI — QEMU migration state isn't portable across accelerators (KVM/HVF/TCG) or -cpu max feature sets, so a CI-captured snapshot couldn't resume reliably on arbitrary user hardware.

Also bundles a pile of CLI QoL fixes (progress bars, PR/run artifact pulls, PR-build download, native-TS ISO writer replacing hdiutil/mkisofs/genisoimage host dep, unit tests).

Scenario Before After
Cold boot (no snapshot) 30–120s same, works as fallback
stack emulator pull (one-time, includes local snapshot capture) ~30s download ~30s download + ~1–3 min cold-boot capture
Snapshot resume, normal start ~5–8s
Snapshot resume, EMULATOR_NO_ROTATION=1 ~2.5s

Backend (/health?db=1) and dashboard (/handler/sign-in) return 200 on all paths. Two successive snapshot resumes produce different rotated PCK/SSK/SAK/CRON_SECRET values per install.

How it works

Build (CI)docker/local-emulator/qemu/build-image.sh:

  1. Cloud-init provisioning runs to completion (migrations, seed, slim-image) producing stack-emulator-<arch>.qcow2.
  2. Image is built with a topology compatible with later snapshot capture (pinned SMP=4, phantom seed/bundle ISOs, STACKCFG runtime ISO mounted at build time, qemu-guest-agent running, placeholder hex secrets baked in under STACK_EMULATOR_BUILD_SNAPSHOT=1).
  3. CI publishes only the qcow2 — no .savevm.zst ships.

Pull (user's machine)packages/stack-cli/src/commands/emulator.ts + run-emulator.sh capture:

  1. stack emulator pull downloads the qcow2 with a progress bar (or from a PR / workflow run via --pr / --run).
  2. CLI invokes run-emulator.sh capture: cold-boots the qcow2 with a matching device layout (phantom ISOs, fsdev, pcie-root-port, virtfs detached — migration-incompatible), waits for backend+dashboard health, then drives QMP: stop → set mapped-ram + multifd caps → migrate file:state.raw → poll query-migratequit. Raw mapped-ram file is zstd-compressed to stack-emulator-<arch>.savevm.zst in the images dir.
  3. --skip-snapshot opts out (first start will then cold-boot).

Runtimerun-emulator.sh start:

  1. Launch QEMU with -incoming defer when a .savevm.zst is present; decompress on first use, keep the .raw cached for subsequent starts.
  2. QMP: same mapped-ram + multifd caps → migrate-incoming file:<.raw> → poll for pausedcont.
  3. Generate fresh per-install secrets on the host; pipe them base64-encoded through QGA guest-exec input-datatrigger-fast-rotate in the guest → docker exec -e … rotate-secrets.
  4. rotate-secrets in the container: validate keys (hex-only), targeted sed on the placeholder PCK across built JS, UPDATE ApiKeySet, supervisorctl restart stack-app cron-jobs (with stopasgroup/killasgroup so the Node children actually die and release their ports).
  5. Poll backend+dashboard health; if anything fails, clean up and fall back to cold boot transparently.

Security model: placeholder hex values are baked into the snapshot (00…ff PCK, 00…ee SSK, 00…dd SAK, 00…cc CRON_SECRET). They are non-secret by construction. Real per-install secrets are generated at each emulator start and never leave the host.

CLI changes (packages/stack-cli)

  • src/lib/iso.ts (new): native TypeScript ISO 9660 + Joliet writer, replacing the host-side hdiutil/mkisofs/genisoimage dependency for generating the STACKCFG runtime config disk. Unit tests in src/lib/iso.test.ts.
  • src/commands/emulator.ts:
    • pull: streamed downloads with progress bar + ETA; --pr <number> and --run <id> to pull from a PR build's CI artifacts (uses extract-zip for the nested zip); --skip-snapshot to opt out of the one-time local capture.
    • start (existing, extended): auto-pulls AND auto-captures when no image exists, so first-ever start is self-bootstrapping; emits STACK_EMULATOR_CLI_WROTE_ISO=1 so the shell helper skips its own ISO regen (avoids the genisoimage host dep).
    • capture (new, invoked by pull and the auto-pull path of start): drives the local snapshot capture via run-emulator.sh.
    • status, stop, reset, list-releases: preflight + path-resolution tightening (STACK_EMULATOR_HOME → images/run dirs).
    • Unit tests in src/commands/emulator.test.ts.
  • EMULATOR_NO_ROTATION=1 env var skips the post-resume rotation (intended for tests/CI where the placeholder secrets are fine — comes with a loud warning).

CI (.github/workflows/qemu-emulator-build.yaml)

  • Builds QEMU 10.2.2 from source (cached), because mapped-ram/multifd migration capabilities aren't available in the distro's QEMU. Enables KVM on ubicloud runners so amd64 boots at hardware speed.
  • amd64 + arm64 both build on the same amd64 matrix (ubicloud-standard-8); arm64 runs under cross-arch TCG (provisioning only — boot/verify smoke test is amd64-only).
  • Verification now runs through the CLI: emulator startemulator statusemulator stop against the freshly-built qcow2 (via STACK_EMULATOR_HOME pointing at the workspace, so the CLI doesn't silently auto-pull a prior release).
  • Packages only the qcow2. No .savevm.zst upload / publish.
  • Release notes updated.

Key files

Shell / guest:

  • docker/local-emulator/qemu/build-image.sh — snapshot-compatible device topology + STACKCFG runtime ISO at build time
  • docker/local-emulator/qemu/run-emulator.shstart, capture, stop, reset, status; -incoming defer, .raw cache, QGA-driven rotation, cold-boot fallback
  • docker/local-emulator/qemu/common.sh (new) — shared qmp_session + capture_vm_state (factored out so build-image.sh and run-emulator.sh share the capture path)
  • docker/local-emulator/qemu/cloud-init/emulator/user-data — placeholder secrets in snapshot mode, wait-for-stack-ready, trigger-fast-rotate, qemu-guest-agent enabled
  • docker/local-emulator/rotate-secrets.sh (new) — in-container rotation (sed + UPDATE + supervisorctl)
  • docker/local-emulator/supervisord.confstopasgroup/killasgroup on stack-app and cron-jobs
  • docker/local-emulator/entrypoint.sh — only mint CRON_SECRET if unset (placeholder supplied in snapshot mode via --env-file)
  • docker/local-emulator/Dockerfile — ships rotate-secrets to /usr/local/bin
  • docker/server/entrypoint.sh — source /run/stack-auth/rotated-secrets.env; skip full-tree sentinel scan on warm restarts via marker

CLI:

  • packages/stack-cli/src/lib/iso.ts (new) + iso.test.ts (new)
  • packages/stack-cli/src/commands/emulator.ts + emulator.test.ts (new)
  • packages/stack-cli/vitest.config.ts (new)

CI:

  • .github/workflows/qemu-emulator-build.yaml

Test plan

  • docker/local-emulator/qemu/build-image.sh {amd64,arm64} produces stack-emulator-<arch>.qcow2 with snapshot-compatible topology
  • stack emulator pull downloads qcow2 with progress, then captures locally (~1–3 min) and writes stack-emulator-<arch>.savevm.zst in the images dir
  • stack emulator pull --skip-snapshot stops after download
  • stack emulator pull --pr <n> / --run <id> pull from PR / workflow run artifacts
  • stack emulator start on a fresh dir auto-pulls and auto-captures, then starts; subsequent starts fast-resume in ~5–8s; backend + dashboard return 200
  • EMULATOR_NO_ROTATION=1 stack emulator start completes in ~2.5s; backend + dashboard return 200 with warning printed
  • Two consecutive emulator start invocations produce different PCK values in the internal ApiKeySet row
  • stack emulator status / stop / reset resolve paths from STACK_EMULATOR_HOME
  • Verified end-to-end on arm64 macOS under HVF (capture ~50s, fast-resume ~6.5s)
  • pnpm lint and pnpm typecheck pass; stack-cli unit tests (iso + emulator) pass
  • CI green on this PR (qemu-emulator-build matrix, smoke test)
  • gh release download emulator-<branch>-latest contains only stack-emulator-<arch>.qcow2 once this PR merges and publish runs

Summary by CodeRabbit

  • New Features

    • Snapshot fast-start/resume with optional warm-snapshot assets, runtime ISO generation, and a cached QEMU build to speed emulator setup.
    • CLI: streamed artifact downloads with progress, improved release/asset handling, stronger preflight checks, and start/status/stop emulator commands.
    • Automated secret rotation and ability to apply rotated secrets at container startup; supervisor control socket enabled.
  • Bug Fixes

    • More robust start/stop/resume flows with automatic fallback to cold boot and improved process-group shutdown behavior.
  • Tests

    • New tests for CLI utilities and ISO image generation.

BilalG1 added 3 commits April 14, 2026 19:19
Ships a compressed RAM/device snapshot (stack-emulator-<arch>.savevm.zst)
alongside the qcow2. `emulator start` resumes from it and rotates the
per-install secrets in place, taking cold-boot from 30-120s to ~6-7s.

Build phase adds a STACKCFG runtime ISO so stack.service can boot during
image creation, starts qemu-guest-agent so its virtio-serial port stays
open in the snapshot, then stop+migrate file:+quit via QMP.

Runtime sends fresh secrets through QGA guest-exec input-data, which pipes
them to trigger-fast-rotate and rotate-secrets inside the container:
targeted sed on the placeholder PCK in built JS, UPDATE on the internal
ApiKeySet, supervisorctl restart stack-app + cron-jobs. Placeholder hex
values are baked in instead of random keys under STACK_EMULATOR_BUILD_SNAPSHOT=1
so no real secret ships in the snapshot.

Device topology and SMP must match at capture and resume; runtime adds
phantom seed/bundle drives and pins SMP=4. Cold-boot fallback kicks in
automatically when the snapshot is missing, corrupt, or incompatible.

supervisord.conf now uses stopasgroup/killasgroup for stack-app and
cron-jobs so supervisor restart actually kills the Node children (they
were keeping their port bindings and breaking rotation).
Snapshot resume drops from ~14s to ~5-7s with rotation, ~2.5s without.

Build uses QEMU's mapped-ram + multifd migration capability so the RAM
state is written at page-aligned offsets in a sparse file. Runtime
decompresses the shipped .savevm.zst once to a local .raw cache and
reloads via -incoming file: + migrate-incoming on subsequent starts,
avoiding the per-start zstd decode.

Adds EMULATOR_NO_ROTATION=1 for tests/CI that don't mind the placeholder
secrets; saves the full ~3s rotation window.

Misc runtime cleanups: tighter QMP/QGA poll intervals (1s → 0.2s),
shorter socat keep-alive windows, 1s settle before the post-rotation
health-check to avoid racing old Node processes, fallback path preserves
the CLI-generated runtime-config.iso instead of blowing away VM_DIR.

Build-time qmp_session keeps stdin open briefly after the caller's
commands so migrate-set-capabilities is actually processed before
socat closes — without this, mapped-ram was silently a no-op.

CI workflow publishes .savevm.zst alongside the .qcow2 (optional asset;
CLI falls back to cold boot when missing). Test + verify steps go
through the CLI now that ISO generation is owned by packages/stack-cli.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
stack-auth-hosted-components Ready Ready Preview, Comment Apr 20, 2026 7:03pm
stack-backend Ready Ready Preview, Comment Apr 20, 2026 7:03pm
stack-dashboard Ready Ready Preview, Comment Apr 20, 2026 7:03pm
stack-demo Ready Ready Preview, Comment Apr 20, 2026 7:03pm
stack-docs Ready Ready Preview, Comment Apr 20, 2026 7:03pm
stack-preview-backend Ready Ready Preview, Comment Apr 20, 2026 7:03pm
stack-preview-dashboard Ready Ready Preview, Comment Apr 20, 2026 7:03pm

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 15, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds QEMU snapshot capture/resume and a Node-based emulator CLI: CI builds/caches QEMU from source, runtime ISO creation, QMP-based VM-state capture/restore, guest-agent secret rotation, updated emulator start/resume flows, and new ISO/CLI tests.

Changes

Cohort / File(s) Summary
CI / QEMU build
\.github/workflows/qemu-emulator-build.yaml
Add Node 22/pnpm setup, build/cache QEMU 10.2.2 from source, switch lifecycle commands to packages/stack-cli Node CLI, and publish optional .savevm.zst release assets.
Emulator snapshot capture
docker/local-emulator/qemu/build-image.sh, docker/local-emulator/qemu/cloud-init/emulator/user-data
Introduce EMULATOR_BUILD_SNAPSHOT flow: pin vCPUs, deterministic placeholder secrets, QMP helpers to migrate VM state to state.raw, compress to .savevm.zst, and cloud-init changes to wait/block for snapshot capture.
Emulator resume & runtime
docker/local-emulator/qemu/run-emulator.sh, docker/local-emulator/qemu/*
Add snapshot resume path with .savevm.zst decompression, -incoming defer + migrate-incoming/cont via QMP, device/topology adjustments, QGA hooks for secret rotation, overlay fingerprinting, and cold-boot fallback.
Secret rotation & env overlays
docker/local-emulator/rotate-secrets.sh, docker/local-emulator/entrypoint.sh, docker/local-emulator/run-cron-jobs.sh, docker/server/entrypoint.sh
New rotate-secrets.sh validates/writes rotated env and patches bundles/DB; entrypoints now preserve or source rotated secrets and use a sentinel to avoid repeated sentinel replacements; cron sources rotated secrets when present.
Supervisor / Docker image
docker/local-emulator/supervisord.conf, docker/local-emulator/Dockerfile, docker/local-emulator/entrypoint.sh
Expose supervisor unix socket for supervisorctl, enable process-group stop/kill for programs, copy/make rotate-secrets.sh executable, and only generate CRON_SECRET when unset.
Stack CLI: downloads, ISO, preflight
packages/stack-cli/src/commands/emulator.ts, packages/stack-cli/package.json, packages/stack-cli/src/commands/emulator.test.ts
Replace gh CLI parsing with GitHub REST helpers, streaming asset downloads with progress, optional snapshot handling, runtime ISO generation, preflight binary checks, and add tests.
ISO builder & tests
packages/stack-cli/src/lib/iso.ts, packages/stack-cli/src/lib/iso.test.ts, packages/stack-cli/vitest.config.ts
Add ISO-9660+Joliet builder (buildIso/writeIso) and comprehensive tests; vitest config sets thread limits; add extract-zip dep and test script.
Misc / small edits
docker/local-emulator/run-cron-jobs.sh, .gitignore, docker/local-emulator/rotate-secrets.sh
Cron job script sources rotated secrets if present; add .claude/scheduled_tasks.lock to .gitignore; add new rotate script.

Sequence Diagram(s)

sequenceDiagram
    participant Host as Host (build-image.sh)
    participant QEMU as QEMU VM
    participant QMP as QMP Socket
    participant Disk as Disk (state.raw)
    participant ZStd as Zstd

    Host->>QEMU: Launch VM for provisioning
    QEMU->>Host: Boot & provision (cloud-init)
    Note over QEMU: Start services, emit STACK_SERVICES_READY
    Host->>QMP: Set migration caps (mapped-ram, multifd)
    Host->>QMP: migrate to file:state.raw
    QMP->>Disk: Write VM state
    Host->>ZStd: Compress state.raw -> .savevm.zst
    Host->>Host: Persist snapshot artifact
Loading
sequenceDiagram
    participant CLI as stack-cli
    participant GH as GitHub API
    participant CDN as Release CDN
    participant Host as Host (run-emulator.sh)
    participant ZStd as Zstd
    participant QMP as QMP Socket
    participant QEMU as QEMU VM
    participant QGA as QGA Guest Agent

    CLI->>GH: Query releases & assets
    GH-->>CLI: Return qcow2 + optional .savevm.zst metadata
    CLI->>CDN: Download qcow2 (streaming)
    CLI->>CDN: Download optional .savevm.zst (or skip)
    Host->>ZStd: Decompress .savevm.zst -> state.raw
    Host->>QEMU: Launch with -incoming defer
    Host->>QMP: migrate-incoming file:state.raw
    QMP->>QEMU: Restore state
    Host->>QMP: cont
    QEMU->>QGA: Guest agent becomes available
    Host->>QGA: trigger-fast-rotate (guest-exec) -> rotate secrets
    QGA->>Host: Rotation complete, emulator ready
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Suggested reviewers

  • aadesh18

Poem

🐇 I hid a snapshot in a zst so neat,

Frozen secrets napping, silent heartbeat.
The CLI stitched an ISO, set ports to play,
QMP hummed softly while the host ran away.
Hop—fast-start woke the VM, and devs cheered all day.

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.61% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fast-start local emulator via RAM snapshot + live secret rotation' directly and clearly summarizes the main feature additions: RAM snapshot-based fast startup and per-install secret rotation.
Description check ✅ Passed The pull request provides a comprehensive description with clear sections explaining the overall changes, implementation details, CLI modifications, and test coverage, fully matching the repository template structure.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch local-emulator-qol-fixes

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 15, 2026

Greptile Summary

This PR adds a RAM snapshot fast-start path to stack emulator start, reducing startup from 30–120 s to ~5–7 s. It captures a fully-warm QEMU VM state via QMP mapped-ram migration at build time, ships the state as a .savevm.zst asset alongside the existing .qcow2, and resumes from it at runtime with per-install secret rotation through QEMU Guest Agent.

  • P1 (emulator.ts line 524): The --pr/--run pull path searches for a non-existent artifact qemu-emulator-<arch>-savevm; CI only produces one artifact (qemu-emulator-<arch>) containing both files. The .savevm.zst is already extracted by the main download, but users always see "Snapshot artifact not available … fast-start disabled." — fix by replacing the second download call with existsSync(snapshotDest).

Confidence Score: 4/5

Safe to merge after fixing the snapshot artifact name mismatch in the --run/--pr pull path.

One P1 bug: users who pull via --pr or --run get a misleading 'fast-start disabled' message even though the snapshot is present (the .savevm.zst is already extracted by the main artifact download). Functionally the emulator still works, but the false-negative log misleads users testing this feature. All other changes are well-structured, thoroughly tested, and include robust cold-boot fallback paths.

packages/stack-cli/src/commands/emulator.ts (lines 521–532): snapshot artifact name mismatch in the --run/--pr download path.

Important Files Changed

Filename Overview
packages/stack-cli/src/commands/emulator.ts Major refactor: drops gh CLI dependency in favour of direct GitHub REST API calls, adds native progress download, ISO generation, and dependency preflight; P1 bug in --run/--pr path where snapshot lookup uses a non-existent artifact name, causing misleading "fast-start disabled" message even when the file is already present.
docker/local-emulator/qemu/run-emulator.sh Core snapshot-resume implementation: decompresses .zst cache, launches QEMU with -incoming defer, drives QMP for mapped-ram restore, then rotates secrets via QGA; well-structured with cold-boot fallback on every failure path; minor: qga.sock not cleaned up in stop_vm.
docker/local-emulator/qemu/build-image.sh Adds QMP-driven snapshot capture (stop → migrate → poll → quit → zstd compress); device topology carefully documented; pinned SMP=4 for migration compat; error paths clean up and exit correctly.
docker/local-emulator/rotate-secrets.sh New script: validates fresh secrets as 64-char hex, sed-replaces placeholder PCK in built JS, updates ApiKeySet via psql, and restarts supervisord programs; SQL literals are safe (hex-only inputs validated above); header comment is stale.
packages/stack-cli/src/lib/iso.ts Clean native ISO 9660 + Joliet writer replacing hdiutil/mkisofs; handles empty files, multi-sector files, volume identifiers, and both-endian fields correctly; well-tested.
.github/workflows/qemu-emulator-build.yaml Adds zstd, builds stack-cli, packages both .qcow2 and .savevm.zst in one artifact, and publishes snapshot assets to GitHub Releases.

Sequence Diagram

sequenceDiagram
    participant CLI as stack-cli
    participant Shell as run-emulator.sh
    participant QEMU as QEMU VM
    participant QGA as qemu-guest-agent
    participant Container as stack container

    CLI->>Shell: emulator start (generates runtime ISO)
    Shell->>Shell: decompress .savevm.zst → .savevm.raw (one-time)
    Shell->>QEMU: launch with -incoming defer
    Shell->>QEMU: QMP: migrate-set-capabilities (mapped-ram + multifd)
    Shell->>QEMU: QMP: migrate-incoming file:.savevm.raw
    Shell->>QEMU: QMP: cont
    QEMU-->>Shell: status=running (~2–5s)
    Shell->>QGA: guest-sync (wait for agent)
    Shell->>QGA: guest-exec trigger-fast-rotate (base64 secrets via input-data)
    QGA->>Container: docker exec -e PCK/SSK/SAK/CRON_SECRET rotate-secrets
    Container->>Container: sed JS files (replace placeholder PCK)
    Container->>Container: psql UPDATE ApiKeySet
    Container->>Container: supervisorctl restart stack-app cron-jobs
    Container-->>Shell: exit 0
    Shell->>Shell: wait_for_condition (health checks)
    Shell-->>CLI: ready (~5–7s total)
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/stack-cli/src/commands/emulator.ts
Line: 521-532

Comment:
**Wrong artifact name for snapshot lookup**

CI produces a single artifact `qemu-emulator-<arch>` containing **both** `.qcow2` and `.savevm.zst`, but this code searches for a non-existent separate artifact `qemu-emulator-<arch>-savevm`. `downloadArtifactByName` returns `false`, so `snapshotDownloaded` stays `false` and the user always sees "Snapshot artifact not available … fast-start disabled" — even though the `.savevm.zst` was already extracted to `imageDir` by the main download above.

The fix is to drop the second download call and instead check whether the file landed on disk:

```suggestion
        // The savevm snapshot was packaged inside the same artifact as the qcow2;
        // both files land in imageDir after extraction above.
        if (existsSync(snapshotDest)) {
          console.log(`Downloaded: ${snapshotDest}`);
        } else {
          console.log(`Snapshot not present in run ${runId}; fast-start disabled.`);
        }
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docker/local-emulator/qemu/run-emulator.sh
Line: 602-603

Comment:
**`qga.sock` not cleaned up on stop**

`stop_vm` removes `monitor.sock`, `serial.log`, and `runtime-config.iso`, but the new `qga.sock` (added for snapshot-resume QGA comms) is left behind. The socket is functionally harmless at the next start (QEMU recreates it with `server=on`), but stale sockets in `VM_DIR` can mislead `qga_wait_ready` if the code is invoked before the new QEMU process binds.

```suggestion
  rm -f "$VM_DIR/qemu.pid" "$VM_DIR/monitor.sock" "$VM_DIR/qga.sock" "$VM_DIR/serial.log"
  rm -f "$VM_DIR/runtime-config.iso"
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docker/local-emulator/rotate-secrets.sh
Line: 4-6

Comment:
**Stale comment describes the old file-based secret delivery**

Lines 5–6 say "Host writes fresh secrets to `/host/stack-runtime/fresh-secrets.env`", but the current implementation never creates that file. Secrets now arrive exclusively via `docker exec -e` env vars (injected by `trigger-fast-rotate`). The `STACK_ROTATE_INPUT` fallback path at line 80 handles backward-compat, but the header comment will mislead future maintainers.

Suggest updating the description to match the actual flow:

```suggestion
# Called inside the stack container by the emulator snapshot-resume path.
# Fresh secrets arrive as environment variables (STACK_SEED_INTERNAL_PROJECT_*
# and CRON_SECRET), injected by the host via `docker exec -e` through
# trigger-fast-rotate. For backward compatibility, STACK_ROTATE_INPUT may
# point to a file as a fallback.
```

How can I resolve this? If you propose a fix, please make it concise.

Reviews (1): Last reviewed commit: "faster snapshot resume via mapped-ram + ..." | Re-trigger Greptile

Comment thread packages/stack-cli/src/commands/emulator.ts Outdated
Comment thread docker/local-emulator/qemu/run-emulator.sh Outdated
Comment thread docker/local-emulator/rotate-secrets.sh
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Nitpick comments (3)
packages/stack-cli/src/lib/iso.ts (1)

400-403: Consider adding error handling for writeFileSync.

The writeIso function directly calls writeFileSync without handling potential I/O errors. Per coding guidelines, avoid uncaught errors. Consider wrapping with appropriate error handling or documenting that callers must handle exceptions.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/stack-cli/src/lib/iso.ts` around lines 400 - 403, The writeIso
function calls writeFileSync directly and can throw uncaught I/O errors; wrap
the writeFileSync(path, buf) call in a try/catch inside writeIso and handle
errors by adding contextual information (include path and volumeId) and either
logging via the module/logger used in this package and rethrowing a new Error or
rethrowing the original error after attaching context, so callers get clear
failure details; keep buildIso(...) unchanged and ensure the catch preserves the
original stack (e.g., include err as cause or append err.message) so debugging
remains possible.
packages/stack-cli/src/commands/emulator.ts (2)

93-102: Validate the credentials payload instead of casting res.json().

Line 93 trusts the backend shape via a type assertion, so a malformed response becomes { project_id: undefined, ... } and the CLI will happily print/inject bad credentials. Parse unknown and assert the three required string fields before returning. As per coding guidelines, "Validate all assumptions either through the type system (preferred), assertions, or tests." and "Do NOT use as/any/type casts to bypass the type system unless explicitly asked by the user."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/stack-cli/src/commands/emulator.ts` around lines 93 - 102, The code
currently casts res.json() to a credential shape which can hide malformed
responses; change to parse the JSON as unknown, then validate that the result is
an object and that data.project_id, data.publishable_client_key, and
data.secret_server_key exist and are strings (and non-empty if desired) before
returning them; replace the type assertion on the `data` variable with a runtime
type guard/assertion that throws a descriptive error if validation fails so the
function that calls this code (the code around the `data` constant and its
return) never returns undefined credentials.

57-69: Use a monotonic clock for elapsed-time calculations.

Both the internal-PCK wait loop and the download progress math are derived from Date.now(), which can jump with wall-clock changes and skew timeouts/ETA. performance.now() is the safer fit in both places. As per coding guidelines, "Don't use Date.now() for measuring elapsed real time. Instead use performance.now()"

Also applies to: 304-313

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/stack-cli/src/commands/emulator.ts` around lines 57 - 69, The
readInternalPck() loop (and the download-progress ETA math elsewhere) currently
uses Date.now() for elapsed-time which can jump; replace those elapsed/timeout
calculations to use a monotonic clock via performance.now() (or Node's
perf_hooks.performance) instead: compute deadline as performance.now() +
timeoutMs and compare performance.now() in the loop, and update any ETA/progress
math that uses Date.now() to use performance.now() so timeout/duration
arithmetic stays stable; ensure units remain milliseconds and import or
reference the correct performance symbol used in this codebase.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Around line 58-67: The workflow only installs pnpm and actions/setup-node when
matrix.arch == 'amd64', leaving arm64 builds to rely on the runner's
preinstalled Node; update the job matrix steps so that actions/setup-node@v4
(with node-version: 22 and cache: pnpm) and pnpm/action-setup@v4 (version:
10.23.0) run for both amd64 and arm64 (remove or adjust the if: matrix.arch ==
'amd64' guards around uses: pnpm/action-setup@v4 and uses:
actions/setup-node@v4), ensuring the arm64 build runs the same Node/pnpm setup
used by the amd64 job, including the later node usages in
docker/local-emulator/qemu/build-image.sh and the steps at lines 93-94.
- Around line 230-232: The failure-step "Print serial log on failure" is
triggering SC2086 because the serial log path uses an unquoted $HOME; update the
step's run command to quote the expanded path (i.e., use
"$HOME/.stack/emulator/run/vm/serial.log") so the shell does not perform
word-splitting or globbing; keep the rest of the redirection and fallback
(2>/dev/null || true) unchanged.

In `@docker/local-emulator/entrypoint.sh`:
- Around line 40-42: The CRON_SECRET assignment masks openssl failures because
the command is embedded in export; change it to run openssl rand separately,
check its exit status and exit on failure, then export the variable.
Specifically, in entrypoint.sh replace the inline export with: run openssl rand
-hex 32 into a temp/value variable (e.g., cron_val), if the command fails log an
error and exit non‑zero, otherwise export CRON_SECRET="$cron_val" so openssl
failures are not hidden.

In `@docker/local-emulator/qemu/run-emulator.sh`:
- Around line 691-700: The cold-boot fallback deletes the needed ISO because
stop_vm() always removes "$VM_DIR/runtime-config.iso"; change stop_vm to accept
an optional flag/parameter (e.g., preserve_runtime_config) and skip deleting
runtime-config.iso when that flag is true, then call stop_vm with
preserve_runtime_config=true from snapshot_fallback_to_cold_boot() before
calling cmd_start so ensure_runtime_config_iso can find the ISO; keep default
behavior of stop_vm unchanged for other callers.
- Around line 311-344: The snapshot branch removes the -virtfs hostfs mount so
the guest /host won't be exposed after resume; restore the hostfs virtfs option
into the snapshot resume path by adding the same -virtfs
"local,path=/,mount_tag=hostfs,security_model=none" entry to runtime_only_args
when snapshot_available is true (alongside where seed_phantom and bundle_phantom
are appended and runtime_only_args is modified), ensuring the mount_tag=hostfs
is present for resumed VMs so the CLI's publish-to-host workflow (used by
cloud-init/user-data and packages/stack-cli) continues to work.
- Around line 311-320: When a snapshot is available we must pin RAM to the
build-time size just like SMP is pinned: add a snapshot_ram variable (similar to
snapshot_smp) that defaults to EMULATOR_BUILD_RAM and, when snapshot_available
is true, override VM_RAM (or the -m argument) with that snapshot_ram so QEMU
restores with the same memory layout used at capture; update the snapshot branch
where snapshot_args and snapshot_smp are set (and where -m "$VM_RAM" is later
used) to use the snapshot_ram/EMULATOR_BUILD_RAM instead of the runtime
EMULATOR_RAM when resuming from a snapshot.

In `@docker/local-emulator/rotate-secrets.sh`:
- Around line 90-100: The UPDATE against the "ApiKeySet" table can silently
affect 0 rows; modify the rotate block that runs psql to perform the UPDATE ...
RETURNING id (or SELECT count(*) via GET DIAGNOSTICS) and capture its output,
then assert exactly one row was returned/updated and exit non‑zero (and log an
error) if not; specifically change the psql invocation around the UPDATE of
"ApiKeySet" (the hard-coded id '3142e763-b230-44b5-8636-aa62f7489c26' and the
environment keys STACK_SEED_INTERNAL_PROJECT_*) to return the updated id and
check in the shell that the result is non-empty (or equals one) before
proceeding to restart services.

In `@docker/server/entrypoint.sh`:
- Line 183: The find invocation uses an unquoted variable find $WORK_DIR/apps
... which can cause word splitting; update the command (the find line that
references WORK_DIR) to quote the variable as "$WORK_DIR" so the path is treated
as a single argument and shellcheck warnings are silenced.

In `@packages/stack-cli/src/commands/emulator.ts`:
- Around line 515-532: The code misreports snapshot availability because the
initial artifact extraction of `qemu-emulator-${arch}` may already create
`snapshotDest`; change the logic in the block around `downloaded`,
`snapshotDownloaded` and the `downloadArtifactByName` call so you first check if
`existsSync(snapshotDest)` immediately after the initial download/extract and
set `snapshotDownloaded = true` if present, and only call
`downloadArtifactByName(repo, runId, \`qemu-emulator-${arch}-savevm\`,
imageDir)` when `snapshotDest` does not exist; update the console messages to
reflect the actual condition (use `snapshotDest`, `dest`, `downloaded`,
`snapshotDownloaded`, `downloadArtifactByName`, `imageDir`, `runId`, and `repo`
to find the code).

---

Nitpick comments:
In `@packages/stack-cli/src/commands/emulator.ts`:
- Around line 93-102: The code currently casts res.json() to a credential shape
which can hide malformed responses; change to parse the JSON as unknown, then
validate that the result is an object and that data.project_id,
data.publishable_client_key, and data.secret_server_key exist and are strings
(and non-empty if desired) before returning them; replace the type assertion on
the `data` variable with a runtime type guard/assertion that throws a
descriptive error if validation fails so the function that calls this code (the
code around the `data` constant and its return) never returns undefined
credentials.
- Around line 57-69: The readInternalPck() loop (and the download-progress ETA
math elsewhere) currently uses Date.now() for elapsed-time which can jump;
replace those elapsed/timeout calculations to use a monotonic clock via
performance.now() (or Node's perf_hooks.performance) instead: compute deadline
as performance.now() + timeoutMs and compare performance.now() in the loop, and
update any ETA/progress math that uses Date.now() to use performance.now() so
timeout/duration arithmetic stays stable; ensure units remain milliseconds and
import or reference the correct performance symbol used in this codebase.

In `@packages/stack-cli/src/lib/iso.ts`:
- Around line 400-403: The writeIso function calls writeFileSync directly and
can throw uncaught I/O errors; wrap the writeFileSync(path, buf) call in a
try/catch inside writeIso and handle errors by adding contextual information
(include path and volumeId) and either logging via the module/logger used in
this package and rethrowing a new Error or rethrowing the original error after
attaching context, so callers get clear failure details; keep buildIso(...)
unchanged and ensure the catch preserves the original stack (e.g., include err
as cause or append err.message) so debugging remains possible.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 40549b75-ec4e-4b25-9d99-05d8519d737b

📥 Commits

Reviewing files that changed from the base of the PR and between 5341371 and 30dbdff.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (17)
  • .github/workflows/qemu-emulator-build.yaml
  • .gitignore
  • docker/local-emulator/Dockerfile
  • docker/local-emulator/entrypoint.sh
  • docker/local-emulator/qemu/build-image.sh
  • docker/local-emulator/qemu/cloud-init/emulator/user-data
  • docker/local-emulator/qemu/run-emulator.sh
  • docker/local-emulator/rotate-secrets.sh
  • docker/local-emulator/run-cron-jobs.sh
  • docker/local-emulator/supervisord.conf
  • docker/server/entrypoint.sh
  • packages/stack-cli/package.json
  • packages/stack-cli/src/commands/emulator.test.ts
  • packages/stack-cli/src/commands/emulator.ts
  • packages/stack-cli/src/lib/iso.test.ts
  • packages/stack-cli/src/lib/iso.ts
  • packages/stack-cli/vitest.config.ts

Comment thread .github/workflows/qemu-emulator-build.yaml
Comment thread .github/workflows/qemu-emulator-build.yaml Outdated
Comment thread docker/local-emulator/entrypoint.sh
Comment thread docker/local-emulator/qemu/run-emulator.sh Outdated
Comment thread docker/local-emulator/qemu/run-emulator.sh Outdated
Comment thread docker/local-emulator/qemu/run-emulator.sh
Comment thread docker/local-emulator/rotate-secrets.sh
Comment thread docker/server/entrypoint.sh
Comment thread packages/stack-cli/src/commands/emulator.ts Outdated
Ubuntu 24.04 (ubicloud-standard-8) ships QEMU 8.2, which predates the
mapped-ram migration capability used by the fast-resume snapshot path.
Compile 10.2.2 once per runner image and cache the resulting /opt/qemu
so subsequent runs are fast.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (2)
.github/workflows/qemu-emulator-build.yaml (2)

300-300: ⚠️ Potential issue | 🟡 Minor

Quote the serial-log path in the failure handler.

Unquoted $HOME still triggers SC2086 here. Use "$HOME/.stack/emulator/run/vm/serial.log" to avoid word-splitting and globbing.

Suggested change
-        run: tail -100 $HOME/.stack/emulator/run/vm/serial.log 2>/dev/null || true
+        run: tail -100 "$HOME/.stack/emulator/run/vm/serial.log" 2>/dev/null || true
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/qemu-emulator-build.yaml at line 300, The failure handler
uses an unquoted path in the tail command which triggers shellcheck SC2086;
update the run step string (the line containing the tail invocation) to quote
the expanded path, e.g. replace the unquoted
$HOME/.stack/emulator/run/vm/serial.log with
"$HOME/.stack/emulator/run/vm/serial.log" (and apply the same quoting to any
similar $HOME-based paths in that handler) so word-splitting and globbing are
avoided.

63-67: ⚠️ Potential issue | 🟠 Major

Install Node for the arm64 build path too.

The if: matrix.arch == 'amd64' guard leaves the arm64 build dependent on whatever Node happens to be preinstalled on the runner, even though Line 132 still runs node docker/local-emulator/generate-env-development.mjs for every matrix entry. That makes the arm64 path brittle as the runner image changes.

Suggested change
-      - uses: actions/setup-node@v4
-        if: matrix.arch == 'amd64'
+      - uses: actions/setup-node@v4
         with:
           node-version: 22
           cache: pnpm
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/qemu-emulator-build.yaml around lines 63 - 67, The
workflow only runs actions/setup-node@v4 when `if: matrix.arch == 'amd64'`,
leaving arm64 to rely on runner defaults while the job later runs `node
docker/local-emulator/generate-env-development.mjs`; remove or widen the `if`
guard so setup-node@v4 is executed for all matrix.arch entries (or add a
separate setup-node step for arm64) and keep `node-version: 22` and `cache:
pnpm` so both amd64 and arm64 use the pinned Node version and pnpm caching.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Around line 169-179: The "Package image" step currently treats a missing
SAVEVM ("docker/local-emulator/qemu/images/stack-emulator-${{ matrix.arch
}}.savevm.zst") as a note; change it so that when matrix.arch == amd64 and the
SAVEVM file is missing the step fails (exit non‑zero) instead of continuing,
while preserving the current best‑effort behavior for other arches; update the
downstream smoke test logic (the conditional around the cold/fast boot switch)
to not silently fall back for amd64 if the artifact is absent so the CI run
fails early and surfaces the missing fast-resume snapshot.
- Around line 144-167: The workflow only runs emulator start/status/stop for
amd64 so arm64 snapshots can be published without a resume check; update the
emulator verification steps (the jobs named "Start emulator and verify", "Verify
services are healthy", and "Stop emulator" that run node
packages/stack-cli/dist/index.js emulator start|status|stop) to also run for
arm64 (change the if condition from matrix.arch == 'amd64' to include arm64,
e.g. matrix.arch == 'amd64' || matrix.arch == 'arm64' or remove the arch guard)
so arm64 gets the same resume smoke test, or alternatively add a guard in the
publish step to skip uploading the arm64 .savevm.zst unless a prior
resume/status check passed.
- Around line 90-105: In the "Build QEMU 10.2.2 from source" step, download the
matching .tar.xz.asc signature, import the QEMU release signing key (via gpg
--import or gpg --keyserver ...) and run gpg --verify /tmp/qemu.tar.xz.asc
/tmp/qemu.tar.xz; if verification fails exit non‑zero and do not extract; apply
the same signature download/import/verify flow at the other QEMU download
location as well so both build and test jobs validate the tarball before tar
-xf.

---

Duplicate comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Line 300: The failure handler uses an unquoted path in the tail command which
triggers shellcheck SC2086; update the run step string (the line containing the
tail invocation) to quote the expanded path, e.g. replace the unquoted
$HOME/.stack/emulator/run/vm/serial.log with
"$HOME/.stack/emulator/run/vm/serial.log" (and apply the same quoting to any
similar $HOME-based paths in that handler) so word-splitting and globbing are
avoided.
- Around line 63-67: The workflow only runs actions/setup-node@v4 when `if:
matrix.arch == 'amd64'`, leaving arm64 to rely on runner defaults while the job
later runs `node docker/local-emulator/generate-env-development.mjs`; remove or
widen the `if` guard so setup-node@v4 is executed for all matrix.arch entries
(or add a separate setup-node step for arm64) and keep `node-version: 22` and
`cache: pnpm` so both amd64 and arm64 use the pinned Node version and pnpm
caching.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a447ac40-4b80-485b-9549-6f893e4754da

📥 Commits

Reviewing files that changed from the base of the PR and between 30dbdff and 6021a04.

📒 Files selected for processing (1)
  • .github/workflows/qemu-emulator-build.yaml

Comment thread .github/workflows/qemu-emulator-build.yaml
Comment thread .github/workflows/qemu-emulator-build.yaml
Comment thread .github/workflows/qemu-emulator-build.yaml Outdated
Copy link
Copy Markdown

@vercel vercel Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Suggestion:

The stop_vm() function deletes runtime-config.iso, causing snapshot_fallback_to_cold_boot() to fail when it tries to restart the VM.

Fix on Vercel

Comment thread packages/stack-cli/src/commands/emulator.ts Outdated
Switch the CLI build step from `pnpm --filter @stackframe/stack-cli run
build` to `turbo run build --filter=@stackframe/stack-cli...` so that
stack-cli's workspace dependencies (@stackframe/js and
@stackframe/stack-shared) also get compiled to their dist/ outputs.
Without them, `node dist/index.js` fails with ERR_MODULE_NOT_FOUND at
import time.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
.github/workflows/qemu-emulator-build.yaml (1)

216-237: Consider extracting the QEMU build into a composite action.

The QEMU cache restore + build-from-source logic is duplicated between the build and test jobs. Extracting it into a composite action or reusable workflow would reduce duplication and ensure both jobs stay in sync.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In @.github/workflows/qemu-emulator-build.yaml around lines 216 - 237, Extract
the duplicated QEMU restore+build steps into a reusable composite action (e.g.,
.github/actions/qemu-build) or a reusable workflow and replace the inline blocks
in both the build and test jobs with a single call; move the actions/cache@v4
restore step and the shell build script (the block under the "Build QEMU 10.2.2
from source" step that references /opt/qemu, key qemu-10.2.2-${{ runner.os
}}-${{ runner.arch }}-v1, and the configure/make/install sequence) into the
composite action, expose inputs for version, prefix, target-list and cache key,
and return/cache-hit as an output so both jobs can consume the same behavior
without duplication.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In @.github/workflows/qemu-emulator-build.yaml:
- Around line 216-237: Extract the duplicated QEMU restore+build steps into a
reusable composite action (e.g., .github/actions/qemu-build) or a reusable
workflow and replace the inline blocks in both the build and test jobs with a
single call; move the actions/cache@v4 restore step and the shell build script
(the block under the "Build QEMU 10.2.2 from source" step that references
/opt/qemu, key qemu-10.2.2-${{ runner.os }}-${{ runner.arch }}-v1, and the
configure/make/install sequence) into the composite action, expose inputs for
version, prefix, target-list and cache key, and return/cache-hit as an output so
both jobs can consume the same behavior without duplication.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 24bafb7a-caa3-4fc2-872e-04f655beca5f

📥 Commits

Reviewing files that changed from the base of the PR and between 6021a04 and 0c0d726.

📒 Files selected for processing (1)
  • .github/workflows/qemu-emulator-build.yaml

BilalG1 added 3 commits April 15, 2026 13:59
First downloadArtifactByName already extracts both qcow2 and savevm.zst
from the single qemu-emulator-${arch} artifact; the second lookup for a
nonexistent -savevm artifact always failed and produced a misleading
'fast-start disabled' message.
The docker/server image runs as the unprivileged `node` user, which
cannot write to /var/run. With `set -e` at the top of the script, the
failed `touch` aborted execution after sentinel replacement but before
the backend/dashboard were started — the Check server health CI step
then saw connection refused on ports 8101/8102.

Move the marker into $WORK_DIR (which is already created and owned by
the running user). The emulator snapshot-resume path still benefits: the
marker persists across supervisorctl restarts because $WORK_DIR lives
on the container filesystem.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
docker/server/entrypoint.sh (1)

156-159: Consider using null-delimited pipeline for robustness.

The find | xargs chain without -print0/-0 can fail on filenames containing spaces or newlines. While the built app tree is unlikely to have such filenames, using null-delimited handling would silence the shellcheck warning and be more robust.

♻️ Suggested improvement
-  unhandled_sentinels=$(find "$WORK_DIR/apps" -type f -exec grep -l "STACK_ENV_VAR_SENTINEL" {} + | \
-    xargs grep -h "STACK_ENV_VAR_SENTINEL" | \
+  unhandled_sentinels=$(find "$WORK_DIR/apps" -type f -print0 | \
+    xargs -0 grep -l "STACK_ENV_VAR_SENTINEL" 2>/dev/null | \
+    xargs grep -h "STACK_ENV_VAR_SENTINEL" 2>/dev/null | \
     grep -o "STACK_ENV_VAR_SENTINEL[A-Z_]*" | \
     sort -u | grep -v "^STACK_ENV_VAR_SENTINEL$")

Alternatively, keep the current approach if you're confident the build output won't contain problematic filenames.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/server/entrypoint.sh` around lines 156 - 159, The pipeline that builds
unhandled_sentinels using find/xargs/grep is unsafe for filenames with
spaces/newlines; update the commands that produce and consume file lists (the
find invocation used to populate unhandled_sentinels and the xargs usage) to use
null-delimited output/inputs (e.g., find ... -print0 and xargs -0 or use grep
-Z) so filenames are handled robustly, and preserve the subsequent grep/sort
-u/grep -v processing to produce the same STACK_ENV_VAR_SENTINEL list.
packages/stack-cli/src/commands/emulator.ts (4)

620-622: Voided promise in cleanup path.

Per coding guidelines, promises should not be voided with .catch(() => {}). While this is a best-effort cleanup before process.exit(), consider at minimum logging the error for debugging purposes.

♻️ Suggested fix
           console.log("\nStopping emulator...");
           runEmulator("stop")
-            .catch(() => { /* best-effort stop */ })
+            .catch((err) => {
+              // Best-effort stop — log but don't block exit
+              console.warn(`Warning: failed to stop emulator cleanly: ${err instanceof Error ? err.message : err}`);
+            })
             .finally(() => process.exit(exitCode));
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/stack-cli/src/commands/emulator.ts` around lines 620 - 622, The
cleanup call currently voids errors from runEmulator("stop") with .catch(() =>
{}); change this to surface at least the error via logging (e.g., use the
existing logger or console to log the error) so failures aren't silently ignored
while preserving the finally that calls process.exit(exitCode); locate the
runEmulator("stop") invocation in emulator.ts and replace the empty catch with a
handler that logs the caught error (including context like "emulator stop
failed") before allowing the finally block to run.

57-68: Use performance.now() for elapsed time measurement.

Per coding guidelines, Date.now() should not be used for measuring elapsed real time. This timeout logic should use performance.now() instead.

♻️ Suggested fix
 async function readInternalPck(timeoutMs = 60_000): Promise<string> {
   const path = internalPckPath();
-  const deadline = Date.now() + timeoutMs;
+  const deadline = performance.now() + timeoutMs;
   let delay = 250;
-  while (Date.now() < deadline) {
+  while (performance.now() < deadline) {
     if (existsSync(path)) {
       const contents = readFileSync(path, "utf-8").trim();
       if (contents) return contents;
     }
     await new Promise((r) => setTimeout(r, delay));
     delay = Math.min(delay * 2, 2000);
   }
   throw new CliError(`Timed out waiting for emulator internal publishable client key at ${path}`);
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/stack-cli/src/commands/emulator.ts` around lines 57 - 68, In
readInternalPck replace use of Date.now() for deadline/loop checks with
performance.now(): compute deadline = performance.now() + timeoutMs and change
the while condition and any timestamps to compare performance.now() against the
deadline; update any related variables (e.g., the deadline calculation and loop
condition inside readInternalPck) and ensure the runtime provides performance
(import from 'perf_hooks' if needed) so elapsed-time measurement follows the
coding guideline.

174-174: Consider using STACK_ prefix for PORT_PREFIX.

Per coding guidelines, environment variables should be prefixed with STACK_ (or NEXT_PUBLIC_STACK_). The fallback to PORT_PREFIX without prefix may be intentional for backwards compatibility, but consider documenting this or migrating to STACK_PORT_PREFIX.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/stack-cli/src/commands/emulator.ts` at line 174, The environment
variable lookup for portPrefix currently reads process.env.PORT_PREFIX and
process.env.NEXT_PUBLIC_STACK_PORT_PREFIX; update it to prefer the STACK_ names
by checking process.env.STACK_PORT_PREFIX then
process.env.NEXT_PUBLIC_STACK_PORT_PREFIX, optionally retaining
process.env.PORT_PREFIX only as a final fallback for backwards compatibility (or
remove it if migrating). Modify the initializer that sets portPrefix (the const
portPrefix assignment) to reflect the new precedence and add a brief inline
comment noting the compatibility fallback if you keep process.env.PORT_PREFIX.

304-312: Use performance.now() for download timing.

The download speed calculation uses Date.now() for elapsed time measurement, which should be performance.now() per coding guidelines.

♻️ Suggested fix
-  const startedAt = Date.now();
+  const startedAt = performance.now();
   let downloaded = 0;
   let lastRender = 0;

   const render = (final: boolean) => {
-    const now = Date.now();
+    const now = performance.now();
     if (!final && now - lastRender < 100) return;
     lastRender = now;
-    const elapsed = Math.max(0.001, (now - startedAt) / 1000);
+    const elapsed = Math.max(0.001, (now - startedAt) / 1000);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@packages/stack-cli/src/commands/emulator.ts` around lines 304 - 312, The
timing logic in the download progress (startedAt, lastRender and render) uses
Date.now(); replace it with performance.now() so elapsed uses high-resolution
timing—set startedAt = performance.now(), compute now = performance.now() inside
render, keep the 100ms throttle and compute elapsed = Math.max(0.001, (now -
startedAt) / 1000). If this runs in Node and performance is not globally
available, import performance from 'perf_hooks' at the top of the module; update
references in the render function accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docker/server/entrypoint.sh`:
- Around line 156-159: The pipeline that builds unhandled_sentinels using
find/xargs/grep is unsafe for filenames with spaces/newlines; update the
commands that produce and consume file lists (the find invocation used to
populate unhandled_sentinels and the xargs usage) to use null-delimited
output/inputs (e.g., find ... -print0 and xargs -0 or use grep -Z) so filenames
are handled robustly, and preserve the subsequent grep/sort -u/grep -v
processing to produce the same STACK_ENV_VAR_SENTINEL list.

In `@packages/stack-cli/src/commands/emulator.ts`:
- Around line 620-622: The cleanup call currently voids errors from
runEmulator("stop") with .catch(() => {}); change this to surface at least the
error via logging (e.g., use the existing logger or console to log the error) so
failures aren't silently ignored while preserving the finally that calls
process.exit(exitCode); locate the runEmulator("stop") invocation in emulator.ts
and replace the empty catch with a handler that logs the caught error (including
context like "emulator stop failed") before allowing the finally block to run.
- Around line 57-68: In readInternalPck replace use of Date.now() for
deadline/loop checks with performance.now(): compute deadline =
performance.now() + timeoutMs and change the while condition and any timestamps
to compare performance.now() against the deadline; update any related variables
(e.g., the deadline calculation and loop condition inside readInternalPck) and
ensure the runtime provides performance (import from 'perf_hooks' if needed) so
elapsed-time measurement follows the coding guideline.
- Line 174: The environment variable lookup for portPrefix currently reads
process.env.PORT_PREFIX and process.env.NEXT_PUBLIC_STACK_PORT_PREFIX; update it
to prefer the STACK_ names by checking process.env.STACK_PORT_PREFIX then
process.env.NEXT_PUBLIC_STACK_PORT_PREFIX, optionally retaining
process.env.PORT_PREFIX only as a final fallback for backwards compatibility (or
remove it if migrating). Modify the initializer that sets portPrefix (the const
portPrefix assignment) to reflect the new precedence and add a brief inline
comment noting the compatibility fallback if you keep process.env.PORT_PREFIX.
- Around line 304-312: The timing logic in the download progress (startedAt,
lastRender and render) uses Date.now(); replace it with performance.now() so
elapsed uses high-resolution timing—set startedAt = performance.now(), compute
now = performance.now() inside render, keep the 100ms throttle and compute
elapsed = Math.max(0.001, (now - startedAt) / 1000). If this runs in Node and
performance is not globally available, import performance from 'perf_hooks' at
the top of the module; update references in the render function accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 764ee52b-d2d7-4018-9826-9079a0127b7d

📥 Commits

Reviewing files that changed from the base of the PR and between 0c0d726 and cfdc882.

⛔ Files ignored due to path filters (1)
  • pnpm-lock.yaml is excluded by !**/pnpm-lock.yaml
📒 Files selected for processing (3)
  • docker/server/entrypoint.sh
  • packages/stack-cli/package.json
  • packages/stack-cli/src/commands/emulator.ts
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/stack-cli/package.json

- stop_vm no longer deletes runtime-config.iso; the CLI owns its
  lifecycle and the snapshot → cold-boot fallback needs it preserved
  (cmd_reset still wipes RUN_DIR for a full reset). Also sweeps qga.sock.
- Write internal-pck to \$VM_DIR on the host in snapshot mode. Cold boot
  publishes this via virtfs/9p; snapshot mode drops virtfs, so
  --config-file flows would otherwise hang. Handles both the rotation
  path (fresh PCK) and EMULATOR_NO_ROTATION (placeholder PCK).
- Pin RAM in snapshot mode to the build-time 4096 (overridable via
  EMULATOR_SNAPSHOT_RAM). Migration replay requires an identical -m
  value, same constraint as CPU count.
- Fail amd64 build when .savevm.zst is missing rather than shipping a
  cold-boot-only release silently. arm64 stays best-effort for now
  because it runs under TCG and can't be verified end-to-end.
- Install Node/pnpm on both arches. arm64 also runs
  generate-env-development.mjs, which otherwise relied on the runner
  image's preinstalled Node.
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
docker/local-emulator/qemu/run-emulator.sh (1)

690-700: Consider a more robust readiness check after rotation.

The sleep 1 workaround (line 694) addresses a race between supervisor sending SIGTERM and the old Node processes dying. This is pragmatic but could still race if the old processes take longer than 1 second to stop. A more robust approach would be to wait for the services to become unavailable first (indicating old processes stopped), then wait for them to become available again.

💡 Optional: Wait for services to go down before checking up
       if ! qga_trigger_fast_rotate; then
         warn "Failed to trigger rotate-secrets — falling back to cold boot."
         snapshot_fallback_to_cold_boot
         return
       fi

-      # Wait for the *new* backend (post-supervisor-restart) to actually be
-      # listening. all_ready may briefly return true against the OLD Node
-      # processes between when supervisor sends SIGTERM and when the children
-      # die; sleep a beat so we measure the real readiness.
-      sleep 1
+      # Wait for services to become unavailable (old processes stopping),
+      # then wait for them to come back up (new processes starting).
+      local down_deadline=$((SECONDS + 10))
+      while [ "$SECONDS" -lt "$down_deadline" ] && all_ready; do
+        sleep 0.2
+      done
       if ! wait_for_condition "rotated services" "$SNAPSHOT_READY_TIMEOUT" all_ready; then
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docker/local-emulator/qemu/run-emulator.sh` around lines 690 - 700, The
single-second sleep after supervisor rotation is racy; replace the blind sleep
with a two-step readiness check that first waits for services to become
unavailable (old processes stopped) and then waits for them to be available
again using wait_for_condition: call wait_for_condition (e.g., "services down")
with a short timeout polling !all_ready or an all_down helper until services are
confirmed down, then call wait_for_condition "rotated services"
"$SNAPSHOT_READY_TIMEOUT" all_ready as before; update the block around
wait_for_condition, all_ready, and SNAPSHOT_READY_TIMEOUT (and keep the existing
fallback to snapshot_fallback_to_cold_boot and tail_vm_logs) so we avoid the
race without a fixed sleep.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@docker/local-emulator/qemu/run-emulator.sh`:
- Around line 690-700: The single-second sleep after supervisor rotation is
racy; replace the blind sleep with a two-step readiness check that first waits
for services to become unavailable (old processes stopped) and then waits for
them to be available again using wait_for_condition: call wait_for_condition
(e.g., "services down") with a short timeout polling !all_ready or an all_down
helper until services are confirmed down, then call wait_for_condition "rotated
services" "$SNAPSHOT_READY_TIMEOUT" all_ready as before; update the block around
wait_for_condition, all_ready, and SNAPSHOT_READY_TIMEOUT (and keep the existing
fallback to snapshot_fallback_to_cold_boot and tail_vm_logs) so we avoid the
race without a fixed sleep.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 325e1ebd-d1ce-4c10-967a-664e70b2d258

📥 Commits

Reviewing files that changed from the base of the PR and between cfdc882 and 2c8ad4c.

📒 Files selected for processing (2)
  • .github/workflows/qemu-emulator-build.yaml
  • docker/local-emulator/qemu/run-emulator.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/qemu-emulator-build.yaml

- run-emulator.sh: drop wait_for_condition poll interval from 1s to 0.2s
- emulator.ts: replace existsSync+readFileSync TOCTOU in readInternalPck
  with try/ENOENT; tighten initial backoff to 50ms; drop redundant
  mkdirSync in startEmulator; surface stop-failure on stderr instead of
  swallowing silently
- iso.ts: inline trivial buildRootDirRecordInVD wrapper
…ct start

- build-image.sh: move runtime.iso drive before netdev so its virtio-blk
  slot precedes virtio-net-pci, matching run-emulator.sh's resume argv.
  Previously migrate-incoming against CI's savevm hit a device-tree
  mismatch and only looked green because snapshot_fallback_to_cold_boot
  silently retried as cold boot.
- run-emulator.sh: drop early-return in ensure_runtime_config_iso so
  PORT_PREFIX/EMULATOR_*_PORT changes take effect on every start; the
  preserved ISO from a prior run would otherwise silently override the
  host-forward ports picked up by QEMU's netdev.
- common.sh: fix backslash-escaped JSON in capture_vm_state's migrate-
  timeout diagnostic; single-quoted printf was emitting literal
  backslashes, so QMP replied with a parse error instead of the real
  query-migrate status.
ensure_runtime_config_iso unconditionally fell through to make_iso_from_dir,
which still required hdiutil/mkisofs/genisoimage — the host dep the
lib/iso.ts TS writer was supposed to remove. The Smoke Test job doesn't
install genisoimage, so emulator start failed. CLI now sets
STACK_EMULATOR_CLI_WROTE_ISO=1 and the shell short-circuits when that flag
plus a non-empty ISO are present.
…-host ISO

Two bugs surfaced by end-to-end testing against a freshly-built qcow2:

1. $STACK_EMULATOR_CLI_WROTE_ISO was referenced unguarded under `set -u`,
   so any code path that didn't set it (capture, direct-shell) tripped
   `unbound variable` before reaching the early-return. Use :- default.

2. ensure_runtime_config_iso was overwriting cmd_capture's specialized
   empty-VM_DIR_HOST ISO with the host-dir variant. Since virtfs is
   detached in capture mode, run-stack-container then tried to publish
   internal-pck to /host/... and restart-looped stack.service, so no
   service ever became healthy and capture aborted after 240s. Previously
   masked by snapshot_fallback_to_cold_boot; 510ef38 fixed the fallback
   mask and exposed this. Skip regen when EMULATOR_CAPTURING_SNAPSHOT=1.
@github-actions github-actions Bot assigned BilalG1 and unassigned N2D4 Apr 17, 2026
Splits EMULATOR_BUILD_SNAPSHOT into two independent flags:

* EMULATOR_BUILD_SNAPSHOT (default 1) — bake placeholder PCK/SSK/SAK/
  CRON_SECRET into the qcow2 so runtime rotate-secrets can swap them
  per install. Cheap; no extra wall time.
* EMULATOR_CAPTURE_SAVEVM (default 0) — start the stack, wait for
  backend+dashboard health, then capture savevm.zst via QMP. Implies
  BUILD_SNAPSHOT.

CI never captures (snapshots aren't portable across KVM/HVF/TCG; users
capture locally on first `stack emulator pull`). The previous default
of capturing in CI was wasted work on amd64 and made arm64 fail —
wait-for-stack-ready couldn't bring the stack up under cross-arch TCG
inside its 600s budget, so cloud-final.service was marked failed.
Prisma's default interactive-transaction timeout is 5s, but under
cross-arch arm64 TCG in the emulator qcow2 build this single batch
(deleteMany + createMany for events + ipInfos) takes ~10s. Bump just
this call to 30s. Production (KVM/native) runs it in <1s, so the
looser bound only engages when the DB is genuinely slow. Per-call
option — no other transaction is affected.
runBulldozerPaymentsInit / paginatedIngress issues a single
$executeRaw per row that writes a JSONB payload. Under cross-arch
arm64 TCG in the qcow2 build it takes ~31s per row and Postgres
kills it with code 57014 (canceling statement due to statement
timeout). 120s covers the observed time with a ~4× safety margin.
@BilalG1 BilalG1 merged commit 37ee5ec into dev Apr 20, 2026
32 of 35 checks passed
@BilalG1 BilalG1 deleted the local-emulator-qol-fixes branch April 20, 2026 21:24
@promptless
Copy link
Copy Markdown
Contributor

promptless Bot commented Apr 20, 2026

Promptless prepared a documentation update related to this change.

Triggered by PR #1340

Updated emulator documentation with fast-start capability (~5-8s startup via VM snapshot resume vs 30-120s cold boot), new CLI options (--pr, --run, --skip-snapshot for pull command), and environment variables (STACK_EMULATOR_HOME, EMULATOR_NO_ROTATION).

Review: Document local emulator fast-start and CLI enhancements

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants